The following packages are required for this practical:
library(mice)
library(dplyr)
library(magrittr)
library(stringr)
library(ggplot2)
Function plot() is the core plotting function in
R. Find out more about plot(): Try both the
help in the help-pane and ?plot in the console. Look at the
examples by running example(plot).
The help tells you all about a functions arguments (the input you can specify), as well as the element the function returns to the Global Environment. There are strict rules for publishing packages in R. For your packages to appear on the Comprehensive R Archive Network (CRAN), a rigorous series of checks have to be passed. As a result, all user-level components (functions, datasets, elements) that are published, have an acompanying documentation that elaborates how the function should be used, what can be expected, or what type of information a data set contains. Help files often contain example code that can be run to demonstrate the workings.
?plot
## Help on topic 'plot' was found in the following packages:
##
## Package Library
## graphics /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
## base /Library/Frameworks/R.framework/Resources/library
##
##
## Using the first match ...
example(plot)
##
## plot> Speed <- cars$speed
##
## plot> Distance <- cars$dist
##
## plot> plot(Speed, Distance, panel.first = grid(8, 8),
## plot+ pch = 0, cex = 1.2, col = "blue")
##
## plot> plot(Speed, Distance,
## plot+ panel.first = lines(stats::lowess(Speed, Distance), lty = "dashed"),
## plot+ pch = 0, cex = 1.2, col = "blue")
##
## plot> ## Show the different plot types
## plot> x <- 0:12
##
## plot> y <- sin(pi/5 * x)
##
## plot> op <- par(mfrow = c(3,3), mar = .1+ c(2,2,3,1))
##
## plot> for (tp in c("p","l","b", "c","o","h", "s","S","n")) {
## plot+ plot(y ~ x, type = tp, main = paste0("plot(*, type = \"", tp, "\")"))
## plot+ if(tp == "S") {
## plot+ lines(x, y, type = "s", col = "red", lty = 2)
## plot+ mtext("lines(*, type = \"s\", ...)", col = "red", cex = 0.8)
## plot+ }
## plot+ }
##
## plot> par(op)
##
## plot> ##--- Log-Log Plot with custom axes
## plot> lx <- seq(1, 5, length.out = 41)
##
## plot> yl <- expression(e^{-frac(1,2) * {log[10](x)}^2})
##
## plot> y <- exp(-.5*lx^2)
##
## plot> op <- par(mfrow = c(2,1), mar = par("mar")-c(1,0,2,0), mgp = c(2, .7, 0))
##
## plot> plot(10^lx, y, log = "xy", type = "l", col = "purple",
## plot+ main = "Log-Log plot", ylab = yl, xlab = "x")
##
## plot> plot(10^lx, y, log = "xy", type = "o", pch = ".", col = "forestgreen",
## plot+ main = "Log-Log plot with custom axes", ylab = yl, xlab = "x",
## plot+ axes = FALSE, frame.plot = TRUE)
##
## plot> my.at <- 10^(1:5)
##
## plot> axis(1, at = my.at, labels = formatC(my.at, format = "fg"))
##
## plot> e.y <- -5:-1 ; at.y <- 10^e.y
##
## plot> axis(2, at = at.y, col.axis = "red", las = 1,
## plot+ labels = as.expression(lapply(e.y, function(E) bquote(10^.(E)))))
##
## plot> par(op)
There are many more functions that can plot specific types of plots.
For example, function hist() plots histograms, but falls
back on the basic plot() function. Packages
lattice and ggplot2 are excellent packages to
use for complex plots. Pretty much any type of plot can be made in R. A
good reference for packages lattice that provides all
R-code can be found at http://lmdvr.r-forge.r-project.org/figures/figures.html.
Alternatively, all ggplot2 documentation can be found at http://docs.ggplot2.org/current/
Create a scatter plot between age and bmi
in the mice::boys data set. Use the base R plotting
device.
With the standard plotting device in R:
plot( boys$bmi ~ boys$age )
Create a scatter plot between age and bmi
in the mice::boys data set. Use ggplot.
The point geom (geom_points()) is used to create
scatterplots.
p <- ggplot( data = boys, aes(age, bmi))
p + geom_point()
## Warning: Removed 21 rows containing missing values (`geom_point()`).
Package ggplot2 offers far greater flexibility in data
visualization than the standard plotting devices in R.
However, it has its own language, which allows you to easily expand
graphs with additional commands. To make these expansions or layers
clearly visible, it is advisable to use the plotting language
conventions. For example,
ggplot( data = mice::boys, aes(age, bmi)) +
geom_point()
would yield the same plot as
ggplot(mice::boys, aes(age, bmi)) + geom_point()
but the latter style may be less informative, especially if more customization takes place and if you share your code with others.
Create a histogram for age in the boys data
set. Use the base R plotting device.
With the standard plotting device in R:
hist(boys$age, breaks = 50)
The breaks = 50 overrides the default breaks between the
bars. By default the plot would be
hist(boys$age)
The title and axis label need to be fixed:
hist(boys$age, breaks = 50, xlab = "Age", main = "Histogram")
Create a histogram for age in the boys data
set. Use ggplot.
The geom geom_histogram() is used to create
histograms.
ggplot( data = boys ) +
geom_histogram(aes(age), binwidth = .4)
Please note that the plots from geom_histogram() and
hist use different calculations for the bars (bins) and
hence may look slightly different.
Create a bar chart for reg in the boys data set. Use the
base R plotting device.
With a standard plotting device in R:
boys %$%
table(reg) %>%
barplot()
Create a bar chart for reg in the boys data set. Use
ggplot.
The geom geom_bar() is used to create bar charts.
ggplot( data = boys ) +
geom_bar(aes(reg))
Note that geom_bar by default plots the
NA's, while barplot() omits the
NA's without warning. If we would not like to plot the
NAs, then a simple filter() (see exercise 2)
on the boys data is efficient.
Create a box plot for hgt with different boxes for
reg in the boys data set. Use the base R
plotting device.
With a standard plotting device in R:
boys %$%
boxplot(hgt ~ reg)
Create a box plot for hgt with different boxes for
reg in the boys data set. Use
ggplot.
The geom geom_boxplot() is used to create box plots.
ggplot( data = boys, aes(reg, hgt)) +
geom_boxplot()
## Warning: Removed 20 rows containing non-finite values (`stat_boxplot()`).
Create a density plot for age with different curves for
boys from the city and boys from rural areas
(!city). Use the base R plotting device.
With a standard plotting device in R:
d1 <- boys %>%
subset(reg == "city") %$%
density(age)
d2 <- boys %>%
subset(reg != "city") %$%
density(age)
plot(d1, col = "red", ylim = c(0, .08))
lines(d2, col = "blue")
The above plot can also be generated without pipes, but results in an
ugly main title. You may edit the title via the
main argument in the plot() function.
plot(density(boys$age[!is.na(boys$reg) & boys$reg == "city"]),
col = "red",
ylim = c(0, .08))
lines(density(boys$age[!is.na(boys$reg) & boys$reg != "city"]),
col = "blue")
Create a density plot for age with different curves for
boys from the city and boys from rural areas
(!city). Use ggplot.
Create a new variable that indicates whether the area is urban or rural. Filter out missing values.
The geom geom_density() is used to create density
plots.
With ggplot2:
boys %>%
mutate(area = ifelse(reg == "city", "city", "rural")) %>%
filter(!is.na(area)) %>%
ggplot(aes(age, fill = area)) +
geom_density(alpha = .3) # some transparency
Now recreate the plot from Exercise 1B with the following specifications:
bmi < 18.5 use
color = "light blue"bmi > 18.5 & bmi < 25 use
color = "light green"bmi > 25 & bmi < 30 use
color = "orange"bmi > 30 use color = "red"It may help to expand the data set with a new variable.
It may be easier to create a new variable that creates the specified
categories. We can use mutate() and the cut()
function to do this quickly
boys2 <-
boys %>%
mutate(class = cut(bmi, c(0, 18.5, 25, 30, Inf),
labels = c("underweight",
"healthy",
"overweight",
"obese")))
by specifying the boundaries of the intervals. In this case we obtain
4 intervals: 0-18.5, 18.5-25,
25-30 and 30-Inf. We can now call
ggplot
ggplot(data = boys2) +
geom_point(aes(age, bmi, col = class))
## Warning: Removed 21 rows containing missing values (`geom_point()`).
Although the different classifications have different colours, the colours are not conform the specifications of this exercise. We can manually override this:
ggplot(data = boys2 ) +
geom_point(aes(age, bmi, col = class)) +
scale_color_manual(values = c("light blue", "light green", "orange", "red"))
## Warning: Removed 21 rows containing missing values (`geom_point()`).
Because there are missing values, ggplot2 displays a
warning message. If we would like to not consider the missing values
when plotting, we can simply exclude the NAs by using a
filter():
ggplot( data = boys2[ !is.na( boys2$class ), ] ) +
geom_point(aes(age, bmi, col = class)) +
scale_color_manual(values = c("light blue", "light green", "orange", "red"))
Specifying a filter on the feature class is sufficient:
age has no missings and the missings in class directly
correspond to missing values on bmi. Filtering on
bmi would therefore yield an identical plot.
Create a diverging bar chart for hgt in the
boys data set, that displays for every age
year that year's mean height in deviations from the overall average
hgt. Use ggplot.
First create 2 new variabels. One for the height deviations and one for the (22?) age categories.
boys %>%
mutate(Hgt = hgt - mean(hgt, na.rm = TRUE),
Age = cut(age, 0:22, labels = 0:21))
Find the mean height deviations in each age group
group_by(Age)
summarize(Hgt = mean(Hgt, na.rm = TRUE))
Define a new variable that indikates wether the height deviations is
below or above average. You can apply cut().
mutate(Diff = cut(Hgt, c(-Inf, 0, Inf),
labels = c("Below Average", "Above Average")))
boys %>%
mutate(Hgt = hgt - mean(hgt, na.rm = TRUE),
Age = cut(age, 0:22, labels = 0:21)) %>%
group_by(Age) %>%
summarize(Hgt = mean(Hgt, na.rm = TRUE)) %>%
mutate(Diff = cut(Hgt, c(-Inf, 0, Inf),
labels = c("Below Average", "Above Average"))) %>%
ggplot(aes(x = Age, y = Hgt, fill = Diff)) +
geom_bar(stat = "identity") +
coord_flip()
We can clearly see that the average height in the group is reached just before age 7.
The group_by() and summarize() function are
advanced dplyr functions used to return the
mean() of deviation Hgt for every group in
Age. For example, if we would like the mean and sd of
height hgt for every region reg in the
boys data, we could call:
boys %>%
group_by(reg) %>%
summarize(mean_hgt = mean(hgt, na.rm = TRUE),
sd_hgt = sd(hgt, na.rm = TRUE))
## # A tibble: 6 × 3
## reg mean_hgt sd_hgt
## <fct> <dbl> <dbl>
## 1 north 152. 43.8
## 2 east 134. 43.2
## 3 west 130. 48.0
## 4 south 128. 46.3
## 5 city 126. 46.9
## 6 <NA> 73.0 29.3
The na.rm argument ensures that the mean and sd of only
the observed values in each category are used.
Read in the sf package, and open the shapefiles on the
Danish municipalities from the course homepage. Plot the
REGIONNAVN variable to see the Danish regions. Plot the
municipal-level population. Use ggplot.
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
denmark <- st_read("DK_map.shp")
## Reading layer `DK_map' from data source
## `/Users/mikkelmollerup/Dropbox/Work/RWORK/DSTStuff/Vietnam/RVietnam/Contents/Part_E/DK_map.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 306 features and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 441524.8 ymin: 6049785 xmax: 892800.8 ymax: 6402308
## CRS: NA
class(denmark)
## [1] "sf" "data.frame"
ggplot( data = denmark ) + geom_sf()
Say we want to colour the maps by the administrative regions coded in
the REGIONNAVN variable. Also, we could move the legend
inside the plot and use a more colorblind-friendly color scale:
denmark %>% ggplot( aes( fill = REGIONNAVN ) ) + geom_sf() +
theme(legend.position = c(0.8,0.7)) +
scale_fill_brewer(palette = "Set2")
We plot the municipal-level population
denmark %>% mutate( population = population/1000) %>%
ggplot(aes( fill = population)) +
geom_sf() +
scale_fill_viridis_c() + # The viridis color scale gives more visual nuance
labs( fill = "Population,\nthousands")
But perhaps population per square kilometer might be more informative than just population
denmark$area <- st_area(denmark)/(1000^2)
denmark %>%
group_by(KOMKODE) %>%
summarise( total.area = sum( as.numeric( area ) ),
population = first( population )) %>%
mutate( pop.area = population/total.area) %>%
ggplot( aes( fill = pop.area)) +
geom_sf()+
scale_fill_viridis_c() + # The viridis color scale gives more visual nuance
labs( fill = "Population per \nsquare kilometer")
Our sf object contains more than one feature for some of
the municipalities. The population number given is for the total
municipality, so we need to compute the total area for each
municipality.
End of Practical